In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.style.use('ggplot')
This worksheet will walk you through the basic process of preparing a visualization using Python/Pandas/Matplotlib.
For this exercise, we will be creating a line plot comparing the number of hosts infected by the Bedep and ConfickerAB Bot Families in the Government/Politic sector.
The data we will be using is in the dailybots.csv file which can be found in the data folder. As is common, we will have to do some data wrangling to get it into a format which we can use to visualize this data. To do that, we'll need to:
| date | ConflikerAB | Bedep | |
|---|---|---|---|
| 0 | 2016-06-01 | 255 | 430 |
| 1 | 2016-06-02 | 431 | 453 |
The way I chose to do this might be a little more complex, but I wanted you to see all the steps involved.
Using the pd.read_csv() function, you can read in the data.
In [5]:
data = pd.read_csv('../../data/dailybots.csv')
data.head()
Out[5]:
In [6]:
filteredData = data[data['industry'] == "Government/Politics"]
filteredData.head()
Out[6]:
Next, I created a second DataFrame which only contains the information from the ConfickerAB botnet. I also reduced the columns to the date and host count. You'll need to rename the host count so that you can merge the other data set later.
In [7]:
filteredData2 = filteredData[filteredData['botfam']== 'ConfickerAB' ][['date','hosts']]
filteredData2.columns = ['date', 'ConfickerAB']
filteredData2.date = pd.to_datetime( filteredData2.date )
filteredData2.head()
Out[7]:
Repeat this porcess for the Bedep botfam in a separate dataFrame.
Next, you'll need to merge the dataframes so that you end up with a dataframe with three columns: the date, the ConfickerAB count, and the the Bedep count. Pandas has a .merge() function which is documented here: http://pandas.pydata.org/pandas-docs/stable/merging.html
In [8]:
filteredData3 = filteredData[filteredData['botfam']== 'Bedep' ][['date','hosts']]
filteredData3.columns = ['date', 'Bedep']
filteredData3.date = pd.to_datetime( filteredData3.date )
finalData = pd.merge(filteredData2, filteredData3, on='date', how='left')
finalData.head()
Out[8]:
In [9]:
finalData.plot(kind='line' )
Out[9]:
The default plot doesn't look horrible, but there are certainly some improvements which can be made. Try the following:
There are many examples in the documentation which is available: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html
In [10]:
finalData.set_index('date', inplace=True)
finalData.plot(kind='line' )
Out[10]:
In [9]:
nicePlot = finalData.plot( kind="line")
nicePlot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.05),
ncol=3, fancybox=True, shadow=False)
Out[9]:
In [10]:
finalData.plot( kind="line", figsize=(60,40))
Out[10]:
In [17]:
fig, axes = plt.subplots(nrows=1, ncols=2)
finalData['ConfickerAB'].plot()
finalData['Bedep'].plot(ax=axes[0])
Out[17]:
In [58]:
from bokeh.plotting import output_notebook
output_notebook()
In [59]:
from bokeh.charts import TimeSeries
from bokeh.io import show
In [60]:
linechart = TimeSeries( data=finalData,
title="ConfickerAB Hosts",
legend="top_left" )
show( linechart )
In [ ]: